Obesity is a growing problem in America and the goal of my project is to focus on obesity in the Philadelphia region and to try to determine what factors influence obesity. I will be applying statistical and machine learning analysis to the Southeastern Pennsylvania Household Survey results from 2015 to determine what factors are the greatest predictors for obesity. You can get access to my Github repository here.
I conceptualized my project after speaking with three faculty members: Andria Johnson, Rebecca Hubbard, and Ariana Chao. Andria Johnson is a Professor in the Department of History and Sociology of Science and recommended that I look at health disparities in the local Philadelphia region. Rebecca Hubbard is an Associate Professor of Biostatistics in the Department of Biostatistics, Epidemiology and Informatics and suggested using machine learning techniques to create a model to predict health outcomes and determine source variability. Ariana Chao is an Assistant Professor of Nursing and reccomended some datasets and discussed the impact of mental health on BMI and other health outcomes as a possibile route.
The obesity epidemic in America is getting worse and current initiatives are not as effective as we’d hope. As of 2015-2016, about four in 10 U.S. adults were obese, up from 37.7 percent during 2013-2014. The news for children and teens isn’t much better. Overall, nearly 19 percent were obese in 2015-2016, up from about 17 percent during the previous two years. Obesity a serious concern because it’s a risk factor for many health conditions, such as diabetes, heart disease, stroke and even some kinds of cancer. Federal, state and local health policymakers need to continue campaigns that promote good nutrition and exercise, the food and beverage industry need to increase the supply of affordable, healthy, nutritious foods and fewer sugary drinks, and consumers need to demand healthier products and policies in their communities. I’m interested in looking at obesity in the Philadelphia region and used the Southeastern Pennsylvania Household Survey results from 2015 to determine what factors are the greatest predictors for obesity in hopes of influencing health initiatives and directing effects to subpopulations that are affected the most.
This problem interdiciplinary as it spans several fields. There is a large data science aspect since the analysis is performed through coding in R. The problem itself draws insights from epidemiology, public health and policy, and sociology. Since healthcare policies are so influential, it’s vital that we have adequate data and effective analysis in order to inform these decisions. Datasets will need to be properly cleaned and transformed, making note of missing values. Additionally, being able to apply context to the obtained results and drawing the correct conclusions is crucial in best representing the population.
The data I am analyzing is from the 2014-2015 Southeastern Pennsylvania Household Health Survey with results from adults and children combined. After loading in the data, variables will be converted to factor or numeric accordingly. Finally, interesting variables will be filtered and cleaned up for analysis.
library(haven)
library(tidyverse)
# read in raw data
raw.data <- read_sav(url("https://raw.githubusercontent.com/joycehelenwang/BMIN503_Final_Project/master/HS15COM1b.sav"))
# convert to factors
data <- as_factor(raw.data)
# clean up variables
data$NUMADULT <- as.numeric(as.character(data$NUMADULT))
table(data$RACE2)##
## White (Not-Latino) Black (Not-Latino) Latino (total) Asian Biracial/Multi Native American Other
## 9024 2673 829 255 349 49 5
levels(data$RACE2) <- c("White", "Black", "Latino", "Asian", "Multi", "Native American", "Other")First, I wanted to look at the breakdown of obesity of weight categoties in the Philadelphia region. The obesity rate for adults is about 31% and for children is about 17% - both less than the national average.
# obesity
obesity <- data %>%
select(Obesity = OBESITY2, Respondant = FLAGCOM) %>%
na.omit
ggplot(obesity, aes(Obesity)) +
geom_bar(fill = "skyblue2") +
labs(y = "Count")table(obesity)## Respondant
## Obesity Adult Child
## Underweight 135 179
## Normal 3292 1488
## Overweight 3499 352
## Obese 3051 405
Then I looked at overall demographics.
## demographics
# county
county <- data %>%
select(Obesity = OBESITY2, County = COUNTY) %>%
na.omit
ggplot(county, aes(County)) +
geom_bar(fill = "skyblue2") +
labs(y = "Count")ggplot(county, aes(County)) +
geom_bar(position = "fill", aes(fill = Obesity)) +
labs(y = "Proportion")# income
income <- data %>%
select(Income = INCOME) %>%
na.omit
ggplot(income, aes(Income)) +
geom_bar(aes(fill=Income)) +
labs(y = "Count") +
theme(legend.text=element_text(size=5),
axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())income2 <- data %>%
select(Obesity = OBESITY2, Income = INCOME) %>%
na.omit
ggplot(income2, aes(Obesity)) +
geom_bar(position = "fill", aes(fill = Income)) +
labs(y = "Proportion") +
theme(legend.text=element_text(size=4))# sex and age
age <- data %>%
select(Obesity = OBESITY2, Age = AGE) %>%
na.omit
ggplot(age, aes(Obesity, Age)) +
geom_boxplot(fill = "skyblue2")ggplot(age, aes(Age)) +
geom_bar(aes(fill = Obesity), position = "fill") +
labs(y = "Proportion")sex <- data %>%
select(Obesity = OBESITY2, Sex = SEX) %>%
na.omit
ggplot(sex, aes(Sex)) +
geom_bar(aes(fill = Obesity)) +
labs(y = "Count")# race
race <- data %>%
select(Obesity = OBESITY2, Race = RACE2) %>%
na.omit
ggplot(race, aes(Race)) +
geom_bar(fill = "skyblue2") +
labs(y = "Count")ggplot(race, aes(Race)) +
geom_bar(aes(fill = Obesity), position = "fill") +
labs(y = "Proportion")# household size
kids <- data %>%
select(Obesity = OBESITY2, Kids = TOTKIDS) %>%
na.omit
ggplot(kids, aes(Kids)) +
geom_bar(fill = "skyblue2") +
labs(y = "Count")adults <- data %>%
select(Obesity = OBESITY2, Adults = NUMADULT) %>%
na.omit
ggplot(adults, aes(Adults)) +
geom_bar(fill = "skyblue2") +
labs(y = "Count")size <- data %>%
select(Obesity = OBESITY2, Kids = TOTKIDS, Adults = NUMADULT) %>%
na.omit %>%
mutate(Household = Kids + Adults)
ggplot(size, aes(Household)) +
geom_bar(fill = "skyblue2") +
labs(y = "Count")ggplot(size, aes(Obesity, Household)) +
geom_boxplot(fill = "skyblue2") +
labs(y = "Count")# employment
employment <- data %>%
select(Obesity = OBESITY2, Employment = MAINEMPL) %>%
na.omit
ggplot(employment, aes(Employment)) +
geom_bar(fill = "skyblue2") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(y = "Count")ggplot(employment, aes(Employment)) +
geom_bar(aes(fill = Obesity), position = "fill") +
labs(y = "Proportion") +
theme(axis.text.x = element_text(angle=45, hjust=1))# education
education <- data %>%
select(Obesity = OBESITY2, Education = RSPGRAD2) %>%
na.omit
ggplot(education, aes(Education)) +
geom_bar(fill = "skyblue2") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(y = "Count")ggplot(education, aes(Education)) +
geom_bar(aes(fill = Obesity), position = "fill") +
labs(y = "Proportion") +
theme(axis.text.x = element_text(angle=45, hjust=1))# marital status
marital <- data %>%
select(Obesity = OBESITY2, Marital = RESPMAR) %>%
na.omit
ggplot(marital, aes(Marital)) +
geom_bar(fill = "skyblue2") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(y = "Count")ggplot(marital, aes(Marital)) +
geom_bar(aes(fill = Obesity), position = "fill") +
labs(y = "Proportion") +
theme(axis.text.x = element_text(angle=45, hjust=1))# sexual identity
sexident <- data %>%
select(Obesity = OBESITY2, SexIdent = SEXIDENT) %>%
na.omit
ggplot(sexident, aes(SexIdent)) +
geom_bar(fill = "skyblue2") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(x = "Sexual Idendity", y = "Count")ggplot(sexident, aes(SexIdent)) +
geom_bar(aes(fill = Obesity), position = "fill") +
labs(x = "Sexual Identity", y = "Proportion") +
theme(axis.text.x = element_text(angle=45, hjust=1))Interesting trends I noted from the graphs:
Next, I wanted to explore if access to care and insurance type influenced obesity.
library(reshape2)
# insurance rate
insurance.data <- data %>%
select(Obesity = OBESITY2, Insured = INSURED) %>%
na.omit
ggplot(insurance.data, aes(Obesity)) +
geom_bar(aes(fill = Insured), position = "fill") +
labs(y = "Proportion")# usual go
usual <- data %>%
select(Obesity = OBESITY2, Usual = USUALGO) %>%
na.omit
ggplot(usual, aes(Usual)) +
geom_bar(fill = "skyblue2") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(y = "Count")ggplot(usual, aes(Usual)) +
geom_bar(aes(fill = Obesity), position = "fill") +
labs(y = "Proportion") +
theme(axis.text.x = element_text(angle=45, hjust=1))# types of insurance - count number of yes
work <- data %>%
group_by(OBESITY2) %>%
count(WORKCOM) %>%
na.omit %>%
filter(WORKCOM == "Yes") %>%
select(Obesity = OBESITY2, Work = n)
self <- data %>%
group_by(OBESITY2) %>%
count(SELFCOM) %>%
na.omit %>%
filter(SELFCOM == "Yes") %>%
select(Obesity = OBESITY2, Self = n)
meda <- data %>%
group_by(OBESITY2) %>%
count(MEDACOM) %>%
na.omit %>%
filter(MEDACOM == "Yes") %>%
select(Obesity = OBESITY2, MedA = n)
medb <- data %>%
group_by(OBESITY2) %>%
count(MEDBCOM) %>%
na.omit %>%
filter(MEDBCOM == "Yes") %>%
select(Obesity = OBESITY2, MedB = n)
ma <- data %>%
group_by(OBESITY2) %>%
count(MACOM) %>%
na.omit %>%
filter(MACOM == "Yes") %>%
select(Obesity = OBESITY2, MA = n)
va <- data %>%
group_by(OBESITY2) %>%
count(VACOM) %>%
na.omit %>%
filter(VACOM == "Yes") %>%
select(Obesity = OBESITY2, VA = n)
other <- data %>%
group_by(OBESITY2) %>%
count(OTHERCOM) %>%
na.omit %>%
filter(OTHERCOM == "Yes") %>%
select(Obesity = OBESITY2, Other = n)
insurance.type <- list(work, self, meda, medb, ma, va, other) %>%
reduce(full_join, by = "Obesity")
insurance <- melt(insurance.type, id.vars='Obesity')
ggplot(insurance, aes(Obesity, value)) +
geom_bar(stat = "identity", position = "fill", aes(fill=variable)) +
labs(y = "Proportion") +
scale_fill_discrete(name = "Insurance Type")ggplot(insurance, aes(variable, value)) +
geom_bar(stat = "identity", fill = "skyblue2") +
labs(x = "Insurance Type", y = "Count") +
scale_fill_discrete(name = "Obesity")ggplot(insurance, aes(variable, value)) +
geom_bar(stat = "identity", position = "fill", aes(fill=Obesity)) +
labs(x = "Insurance Type", y = "Proportion") +
scale_fill_discrete(name = "Obesity")Interesting trends I noted from the graphs:
Next, I looked at other health outcomes to see if there was any relationship to obesity.
## health outcomes
# overall health
health <- data %>%
select(Obesity = OBESITY2, Health = HEALTH) %>%
na.omit
ggplot(health, aes(Health)) +
geom_bar(fill = "skyblue2") +
labs(y = "Count")ggplot(health, aes(Health)) +
geom_bar(aes(fill = Obesity), position = "fill") +
labs(y = "Proportion")# dentist
dentist <- data %>%
select(Obesity = OBESITY2, Dentist = DENTIST) %>%
na.omit
ggplot(dentist, aes(Obesity)) +
geom_bar(position = "fill", aes(fill = Dentist)) +
labs(y = "Proportion")# smoking
smoke <- data %>%
select(Obesity = OBESITY2, Smoking = SMOKHOME) %>%
na.omit
ggplot(smoke, aes(Obesity)) +
geom_bar(position = "fill", aes(fill = Smoking)) +
labs(y = "Proportion")# asthma
asthma <- data %>%
group_by(OBESITY2) %>%
count(EVRASTH) %>%
na.omit %>%
rename(Obesity = OBESITY2, Asthma = EVRASTH, Count = n)
ggplot(asthma, aes(Obesity, Count)) +
geom_bar(stat = "identity", position = "fill", aes(fill = Asthma)) +
labs(y = "Proportion")Interesting trends I noted:
After performing exploratory data analysis, I wanted to use machine learning techniques to determine the highest predictors of obesity and create a prediction model. I selected relevent variables that revolved around demographics, access to care, and health outcomes since I believed those would have the most impact on obesity. Then I classified “Non-Obese” as the original Underweight, Normal, and Overweight categories. I used random forest and glm for my models.
library(randomForest)
library(pROC)
library(data.table)
# select relevant data (demographics, access to care, health outcomes)
obesity.data <- data %>%
select(Obesity = OBESITY2, County = COUNTY, Sex = SEX, Age = AGE, Kids = TOTKIDS, Adults = NUMADULT, Health = HEALTH, Asthma = EVRASTH, Usual = USUALGO, Work = WORKCOM, Self = SELFCOM, MedA = MEDACOM, MedB = MEDBCOM, MA = MACOM, VA = VACOM, Other = OTHERCOM, Insured = INSURED, Dentist = DENTIST, Obesity = OBESITY2, Smoking = SMOKHOME, Employment = MAINEMPL, Education = RSPGRAD2, Marital = RESPMAR, SexIdent = SEXIDENT, Race = RACE2, Income = INCOME) %>%
filter(complete.cases(.)) %>%
mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Underweight")) %>%
mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Normal")) %>%
mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Overweight"))
# random forest
obesity.rf <- randomForest(Obesity ~ ., data = obesity.data, importance = TRUE)
importance <- data.frame(obesity.rf$importance)
gini <- importance %>%
rownames_to_column("variable") %>%
arrange(desc(MeanDecreaseGini))
head(gini, n = 10)## variable Non.Obese Obese MeanDecreaseAccuracy MeanDecreaseGini
## 1 Income 0.027296467 -0.0143836078 0.015618309 498.4078
## 2 Age 0.014508513 0.0041712781 0.011608246 419.3644
## 3 Health 0.016359415 0.0355829868 0.021734274 264.4766
## 4 Education 0.013260941 -0.0028532365 0.008745387 194.7179
## 5 County 0.001779015 -0.0001010282 0.001252389 186.8185
## 6 Marital 0.007474865 -0.0045213082 0.004115743 155.8901
## 7 Kids 0.007674388 -0.0049985880 0.004124509 137.7265
## 8 Adults 0.003490343 -0.0016439285 0.002048496 137.6220
## 9 Employment 0.007332163 -0.0041518782 0.004118659 112.4854
## 10 Race 0.008544940 -0.0009852724 0.005880289 108.4254
The top ten variables from random forest were:
obesity.rf.top <- randomForest(Obesity ~ Income + Age + Health + Education + County + Marital + Adults + Kids + Employment + Race, data = obesity.data, importance = TRUE)
rf.pred <- predict(obesity.rf.top, obesity.data, type="prob")
# glm
obesity.glm <- glm(Obesity ~ ., data = obesity.data, family = binomial(logit))
coef <- summary(obesity.glm)[12]
coef.sort <- as.data.frame(coef)
coef.sort <- setDT(coef.sort, keep.rownames = TRUE)[]
names(coef.sort) <- c("Variable", "Estimate","SE","tval","pval")
coef.sort <- arrange(coef.sort, pval)
head(coef.sort, n = 20)## Variable Estimate SE tval pval
## 1 HealthGood 1.348370701 0.08241119 16.361499 3.601510e-60
## 2 HealthFair 1.477199284 0.10563910 13.983453 1.967091e-44
## 3 HealthVery Good 0.766727728 0.07836307 9.784299 1.315034e-22
## 4 HealthPoor 1.140379388 0.16231813 7.025582 2.131752e-12
## 5 AsthmaNo -0.379424988 0.06758017 -5.614443 1.971967e-08
## 6 (Intercept) -2.116327055 0.39040345 -5.420872 5.930907e-08
## 7 Age 0.008571241 0.00188828 4.539180 5.647344e-06
## 8 RaceBlack 0.348432798 0.07712853 4.517561 6.255611e-06
## 9 DentistMore than one year 0.234693584 0.06585853 3.563602 3.658008e-04
## 10 EmploymentRetired -0.352015534 0.10529787 -3.343045 8.286437e-04
## 11 RaceAsian -1.035357671 0.32503215 -3.185401 1.445537e-03
## 12 MedANo 0.353660840 0.13517001 2.616415 8.885846e-03
## 13 MaritalLiving with a partner 0.324644453 0.13223552 2.455047 1.408661e-02
## 14 EducationSome college 0.340596777 0.14734178 2.311610 2.079917e-02
## 15 SexIdentSomething else -0.965556747 0.42137985 -2.291417 2.193933e-02
## 16 SexFemale -0.124530056 0.05533626 -2.250424 2.442204e-02
## 17 Income$13,800/$14,000 to under $15,500/$15,700 0.661988788 0.31370074 2.110256 3.483633e-02
## 18 UsualHospital outpatient clinic -0.277785405 0.13304027 -2.087980 3.679966e-02
## 19 UsualOther place -0.429605057 0.20671431 -2.078255 3.768588e-02
## 20 InsuredNo -0.370044068 0.18370493 -2.014339 4.397392e-02
The top ten variables from glm were:
obesity.glm.top <- glm(Obesity ~ Health + Asthma + Age + Race + Dentist + Employment + MedA + Marital + Education + SexIdent, data = obesity.data, family = binomial(logit))
glm.pred <- predict(obesity.glm.top, obesity.data, type="response")
N = nrow(obesity.data)
K = 10
set.seed(1234)
s = sample(1:K, size=N, replace=T)
pred.outputs.glm <- vector(mode="numeric", length=N)
pred.outputs.rf <- vector(mode="numeric", length=N)
obs.outputs <- vector(mode="numeric", length=N)
offset <- 0
for(i in 1:K){
train <- filter(obesity.data, s != i)
test <- filter(obesity.data, s == i)
obs.outputs[1:length(s[s==i]) + offset] <- test$Obesity
#GLM train/test
glm <- glm(Obesity ~ Health + Asthma + Age + Race + Dentist + Employment + MedA + Marital + Education + SexIdent, data=train, family=binomial(logit))
glm.pred.curr <- predict(glm, test, type="response")
pred.outputs.glm[1:length(s[s==i]) + offset] <- glm.pred.curr
#RF train/test
rf <- randomForest(Obesity ~ Age + Income + Health + County + Education + Marital + Kids + Adults + Race + Employment, data=train, ntree=100)
rf.pred.curr <- predict(rf, newdata=test, type="prob")
pred.outputs.rf[1:length(s[s==i]) + offset] <- rf.pred.curr[,2]
offset <- offset + length(s[s==i])
}
roc(obesity.data$Obesity, glm.pred, ci = TRUE)##
## Call:
## roc.default(response = obesity.data$Obesity, predictor = glm.pred, ci = TRUE)
##
## Data: glm.pred in 5833 controls (obesity.data$Obesity Non-Obese) < 2266 cases (obesity.data$Obesity Obese).
## Area under the curve: 0.7076
## 95% CI: 0.6954-0.7199 (DeLong)
roc(obs.outputs, pred.outputs.glm, ci = TRUE)##
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.glm, ci = TRUE)
##
## Data: pred.outputs.glm in 5833 controls (obs.outputs 1) < 2266 cases (obs.outputs 2).
## Area under the curve: 0.6986
## 95% CI: 0.6863-0.7109 (DeLong)
roc(obesity.data$Obesity, rf.pred[,1], ci = TRUE)##
## Call:
## roc.default(response = obesity.data$Obesity, predictor = rf.pred[, 1], ci = TRUE)
##
## Data: rf.pred[, 1] in 5833 controls (obesity.data$Obesity Non-Obese) > 2266 cases (obesity.data$Obesity Obese).
## Area under the curve: 0.9986
## 95% CI: 0.9979-0.9993 (DeLong)
roc(obs.outputs, pred.outputs.rf, ci = TRUE)##
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.rf, ci = TRUE)
##
## Data: pred.outputs.rf in 5833 controls (obs.outputs 1) < 2266 cases (obs.outputs 2).
## Area under the curve: 0.6772
## 95% CI: 0.6648-0.6897 (DeLong)
plot.roc(obesity.data$Obesity, glm.pred, col = "coral")
plot.roc(obs.outputs, pred.outputs.glm, col = "deepskyblue", add = TRUE)
plot.roc(obesity.data$Obesity, rf.pred[,1], col="darkorchid", add = TRUE)
plot.roc(obs.outputs, pred.outputs.rf, col="mediumseagreen", add=TRUE)
legend("bottomright",
legend=c("GLM Training", "GLM Cross-Validation", "RF Training", "RF Cross-Validation"),
col=c("coral", "deepskyblue", "darkorchid", "mediumseagreen"), lwd=1)The AUC for the models were:
I also wanted to see if there was a difference between adults and children, so I separted the data by respondant type and performed the same analysis as previously.
child.obesity.data <- data %>%
filter(FLAGCOM == "Child") %>%
select(Obesity = OBESITY2, County = COUNTY, Sex = SEX, Age = AGE, Kids = TOTKIDS, Adults = NUMADULT, Health = HEALTH, Asthma = EVRASTH, Usual = USUALGO, Work = WORKCOM, Self = SELFCOM, MedA = MEDACOM, MedB = MEDBCOM, MA = MACOM, VA = VACOM, Other = OTHERCOM, Insured = INSURED, Dentist = DENTIST, Obesity = OBESITY2, Smoking = SMOKHOME, Employment = MAINEMPL, Education = RSPGRAD2, Marital = RESPMAR, SexIdent = SEXIDENT, Race = RACE2, Income = INCOME) %>%
filter(complete.cases(.)) %>%
mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Underweight")) %>%
mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Normal")) %>%
mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Overweight"))
# random forest
child.obesity.rf <- randomForest(Obesity ~ ., data = child.obesity.data, importance = TRUE)
importance <- data.frame(child.obesity.rf$importance)
gini <- importance %>%
rownames_to_column("variable") %>%
arrange(desc(MeanDecreaseGini))
head(gini, n = 10)## variable Non.Obese Obese MeanDecreaseAccuracy MeanDecreaseGini
## 1 Income 0.0153539296 0.0142592766 0.0151723661 87.11472
## 2 Age 0.0051132113 0.0217663766 0.0078159236 52.01621
## 3 Education 0.0063196199 0.0095553170 0.0068533396 34.58930
## 4 Health 0.0058235693 0.0205955937 0.0082377298 34.33234
## 5 County 0.0011954064 0.0067714310 0.0021030403 32.08490
## 6 Kids 0.0007308595 0.0018609799 0.0009195634 24.28021
## 7 Marital 0.0042233048 -0.0004837701 0.0034611150 23.18928
## 8 Race 0.0044912730 0.0021706454 0.0040926524 22.04800
## 9 Adults 0.0020377475 -0.0038687373 0.0010744458 19.76558
## 10 Asthma 0.0026255846 0.0070700737 0.0033648849 12.49906
The top ten variables from random forest for children were:
child.obesity.rf.top <- randomForest(Obesity ~ Income + Age + Education + County + Health + Kids + Marital + Race + Adults + Asthma, data = child.obesity.data, importance = TRUE)
child.rf.pred <- predict(child.obesity.rf.top, child.obesity.data, type="prob")
# glm
child.obesity.glm <- glm(Obesity ~ ., data = child.obesity.data, family = binomial(logit))
coef <- summary(child.obesity.glm)[12]
coef.sort <- as.data.frame(coef)
coef.sort <- setDT(coef.sort, keep.rownames = TRUE)[]
names(coef.sort) <- c("Variable", "Estimate","SE","tval","pval")
coef.sort <- arrange(coef.sort, pval)
head(coef.sort, n = 25)## Variable Estimate SE tval pval
## 1 Age -0.1605368 0.02260187 -7.102811 1.222450e-12
## 2 HealthGood 1.0320349 0.20800900 4.961492 6.995386e-07
## 3 HealthFair 1.7176299 0.39584298 4.339170 1.430220e-05
## 4 AsthmaNo -0.5403560 0.16907222 -3.196007 1.393438e-03
## 5 SexFemale -0.4658469 0.14852340 -3.136522 1.709646e-03
## 6 HealthVery Good 0.5122784 0.17298608 2.961385 3.062585e-03
## 7 MaritalLiving with a partner 0.8144577 0.33829025 2.407571 1.605905e-02
## 8 Income$75,000 to under $100,000 -1.6341018 0.69118576 -2.364201 1.806902e-02
## 9 Income$250,000 or more -1.6450597 0.73899854 -2.226066 2.600974e-02
## 10 Income$150,000 to under $250,000 -1.4935607 0.69239247 -2.157101 3.099777e-02
## 11 CountyChester -0.5632715 0.28275805 -1.992062 4.636430e-02
## 12 SelfNo 0.2922436 0.15077956 1.938217 5.259669e-02
## 13 Income$29,300/$29,750 to under $31,200/$31,900 -1.6547405 0.86981284 -1.902410 5.711759e-02
## 14 EmploymentRetired -3.1024600 1.64424819 -1.886856 5.917966e-02
## 15 Income$27,600/$28,000 to under $29,300/$29,750 -1.7516617 0.97228479 -1.801593 7.160943e-02
## 16 Income$55,000/$55,400 to under $60,000/$60,000 -1.5476525 0.87625497 -1.766213 7.736020e-02
## 17 Income$100,000 to under $150,000 -1.1224344 0.67372750 -1.666007 9.571211e-02
## 18 SexIdentSomething else 3.1873632 1.91759685 1.662165 9.647960e-02
## 19 InsuredNo -1.0207388 0.64929723 -1.572067 1.159350e-01
## 20 Income$5,750/$5,850 to under $7,750/$7,850 -1.9483382 1.34704730 -1.446377 1.480715e-01
## 21 Income$39,100/$40,000 to under $41,400/$41,900 -1.4361522 1.02721566 -1.398102 1.620825e-01
## 22 MaritalSeparated -0.7559577 0.54646623 -1.383357 1.665555e-01
## 23 RaceBlack 0.3134588 0.22681875 1.381979 1.669781e-01
## 24 Income$13,800/$14,000 to under $15,500/$15,700 -1.6891705 1.29546672 -1.303909 1.922647e-01
## 25 EmploymentFull-time student/Job training -1.7556245 1.36297975 -1.288078 1.977188e-01
The top ten variables from glm for children were:
child.obesity.glm.top <- glm(Obesity ~ Age + Health + Asthma + Sex + Marital + Income + Self + Employment + SexIdent + Insured, data = child.obesity.data, family = binomial(logit))
child.glm.pred <- predict(child.obesity.glm.top, child.obesity.data, type="response")
N = nrow(child.obesity.data)
K = 10
set.seed(1234)
s = sample(1:K, size=N, replace=T)
child.pred.outputs.glm <- vector(mode="numeric", length=N)
child.pred.outputs.rf <- vector(mode="numeric", length=N)
child.obs.outputs <- vector(mode="numeric", length=N)
offset <- 0
for(i in 1:K){
train <- filter(child.obesity.data, s != i)
test <- filter(child.obesity.data, s == i)
child.obs.outputs[1:length(s[s==i]) + offset] <- test$Obesity
#GLM train/test
glm <- glm(Obesity ~ Age + Health + Asthma + Sex + Marital + Income + Self + Employment + SexIdent + Insured, data=train, family=binomial(logit))
glm.pred.curr <- predict(glm, test, type="response")
child.pred.outputs.glm[1:length(s[s==i]) + offset] <- glm.pred.curr
#RF train/test
rf <- randomForest(Obesity ~ Income + Age + Education + County + Health + Kids + Marital + Race + Adults + Asthma, data=train, ntree=100)
rf.pred.curr <- predict(rf, newdata=test, type="prob")
child.pred.outputs.rf[1:length(s[s==i]) + offset] <- rf.pred.curr[,2]
offset <- offset + length(s[s==i])
}
roc(child.obesity.data$Obesity, child.glm.pred, ci = TRUE)##
## Call:
## roc.default(response = child.obesity.data$Obesity, predictor = child.glm.pred, ci = TRUE)
##
## Data: child.glm.pred in 1529 controls (child.obesity.data$Obesity Non-Obese) < 301 cases (child.obesity.data$Obesity Obese).
## Area under the curve: 0.7894
## 95% CI: 0.7619-0.817 (DeLong)
roc(child.obs.outputs, child.pred.outputs.glm, ci = TRUE)##
## Call:
## roc.default(response = child.obs.outputs, predictor = child.pred.outputs.glm, ci = TRUE)
##
## Data: child.pred.outputs.glm in 1529 controls (child.obs.outputs 1) < 301 cases (child.obs.outputs 2).
## Area under the curve: 0.7244
## 95% CI: 0.6921-0.7567 (DeLong)
roc(child.obesity.data$Obesity, child.rf.pred[,1], ci = TRUE)##
## Call:
## roc.default(response = child.obesity.data$Obesity, predictor = child.rf.pred[, 1], ci = TRUE)
##
## Data: child.rf.pred[, 1] in 1529 controls (child.obesity.data$Obesity Non-Obese) > 301 cases (child.obesity.data$Obesity Obese).
## Area under the curve: 0.9986
## 95% CI: 0.9972-1 (DeLong)
roc(child.obs.outputs, child.pred.outputs.rf, ci = TRUE)##
## Call:
## roc.default(response = child.obs.outputs, predictor = child.pred.outputs.rf, ci = TRUE)
##
## Data: child.pred.outputs.rf in 1529 controls (child.obs.outputs 1) < 301 cases (child.obs.outputs 2).
## Area under the curve: 0.7066
## 95% CI: 0.6744-0.7387 (DeLong)
plot.roc(child.obesity.data$Obesity, child.glm.pred, col = "coral")
plot.roc(child.obs.outputs, child.pred.outputs.glm, col = "deepskyblue", add = TRUE)
plot.roc(child.obesity.data$Obesity, child.rf.pred[,1], col="darkorchid", add = TRUE)
plot.roc(child.obs.outputs, child.pred.outputs.rf, col="mediumseagreen", add=TRUE)
legend("bottomright",
legend=c("GLM Training", "GLM Cross-Validation", "RF Training", "RF Cross-Validation"),
col=c("coral", "deepskyblue", "darkorchid", "mediumseagreen"), lwd=1)The AUC for the models were:
adult.obesity.data <- data %>%
filter(FLAGCOM == "Adult") %>%
select(Obesity = OBESITY2, County = COUNTY, Sex = SEX, Age = AGE, Kids = TOTKIDS, Adults = NUMADULT, Health = HEALTH, Asthma = EVRASTH, Usual = USUALGO, Work = WORKCOM, Self = SELFCOM, MedA = MEDACOM, MedB = MEDBCOM, MA = MACOM, VA = VACOM, Other = OTHERCOM, Insured = INSURED, Dentist = DENTIST, Obesity = OBESITY2, Smoking = SMOKHOME, Employment = MAINEMPL, Education = RSPGRAD2, Marital = RESPMAR, SexIdent = SEXIDENT, Race = RACE2, Income = INCOME) %>%
filter(complete.cases(.)) %>%
mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Underweight")) %>%
mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Normal")) %>%
mutate(Obesity = fct_recode(Obesity, "Non-Obese" = "Overweight"))
# random forest
adult.obesity.rf <- randomForest(Obesity ~ ., data = adult.obesity.data, importance = TRUE)
importance <- data.frame(adult.obesity.rf$importance)
gini <- importance %>%
rownames_to_column("variable") %>%
arrange(desc(MeanDecreaseGini))
head(gini, n = 10)## variable Non.Obese Obese MeanDecreaseAccuracy MeanDecreaseGini
## 1 Income 2.464178e-02 -1.644699e-02 0.0117284846 425.76153
## 2 Age 1.669662e-02 -5.324369e-03 0.0097822595 319.97189
## 3 Health 2.200059e-02 2.733470e-02 0.0236675918 216.18614
## 4 Education 1.301671e-02 -4.772839e-03 0.0074197266 162.07644
## 5 County -9.847389e-05 -1.491345e-03 -0.0005410109 155.51560
## 6 Marital 6.987936e-03 -2.049963e-03 0.0041536409 136.67958
## 7 Adults 3.374632e-03 -1.860507e-03 0.0017297277 118.63535
## 8 Kids 4.937940e-03 -3.111993e-03 0.0024077995 111.00460
## 9 Employment 1.037812e-02 -6.915022e-03 0.0049545151 101.45155
## 10 Race 8.885691e-03 3.781615e-05 0.0061056536 86.94754
The top ten variables from random forest for adults were:
adult.obesity.rf.top <- randomForest(Obesity ~ Income + Age + Health + Education + County + Marital + Adults + Kids + Employment + Race, data = adult.obesity.data, importance = TRUE)
adult.rf.pred <- predict(adult.obesity.rf.top, adult.obesity.data, type="prob")
# glm
adult.obesity.glm <- glm(Obesity ~ ., data = adult.obesity.data, family = binomial(logit))
coef <- summary(adult.obesity.glm)[12]
coef.sort <- as.data.frame(coef)
coef.sort <- setDT(coef.sort, keep.rownames = TRUE)[]
names(coef.sort) <- c("Variable", "Estimate","SE","tval","pval")
coef.sort <- arrange(coef.sort, pval)
head(coef.sort, n = 20)## Variable Estimate SE tval pval
## 1 HealthGood 1.4461474 0.09636849 15.006434 6.663477e-51
## 2 HealthFair 1.5559778 0.11775071 13.214170 7.268390e-40
## 3 HealthVery Good 0.8451756 0.09358468 9.031133 1.699037e-19
## 4 HealthPoor 1.3095187 0.17318565 7.561358 3.988824e-14
## 5 AsthmaNo -0.3740692 0.07605725 -4.918258 8.731766e-07
## 6 RaceBlack 0.3692622 0.08338249 4.428534 9.487583e-06
## 7 (Intercept) -1.9024318 0.44751104 -4.251139 2.126857e-05
## 8 DentistMore than one year 0.2438354 0.06855360 3.556857 3.753178e-04
## 9 RaceAsian -1.5160284 0.43284358 -3.502485 4.609389e-04
## 10 UsualHospital outpatient clinic -0.4058594 0.14570116 -2.785560 5.343529e-03
## 11 Income$13,800/$14,000 to under $15,500/$15,700 0.8820372 0.33223519 2.654858 7.934180e-03
## 12 Income$75,000 to under $100,000 0.6099591 0.25744866 2.369246 1.782441e-02
## 13 SexIdentSomething else -0.9782676 0.44670217 -2.189977 2.852589e-02
## 14 Income$7,750/$7,850 to under $9,750/$9,900 0.5998652 0.28579896 2.098906 3.582518e-02
## 15 MedANo 0.2936879 0.14184805 2.070440 3.841113e-02
## 16 EmploymentRetired -0.2329424 0.11289807 -2.063299 3.908424e-02
## 17 UsualOther place -0.4215336 0.21192100 -1.989107 4.668936e-02
## 18 Income$31,200/$31,900 to under $35,400/$36,000 0.5270726 0.27853937 1.892273 5.845460e-02
## 19 Income$19,600/$20,000 to under $23,500/$23,750 0.4876175 0.27601337 1.766644 7.728779e-02
## 20 EducationSome college 0.2679400 0.15418461 1.737787 8.224836e-02
The top ten variables from glm for adults were:
adult.obesity.glm.top <- glm(Obesity ~ Health + Asthma + Race + Dentist + Usual + Marital + Income + SexIdent + MedA + Employment, data = adult.obesity.data, family = binomial(logit))
adult.glm.pred <- predict(adult.obesity.glm.top, adult.obesity.data, type="response")
N = nrow(adult.obesity.data)
K = 10
set.seed(1234)
s = sample(1:K, size=N, replace=T)
adult.pred.outputs.glm <- vector(mode="numeric", length=N)
adult.pred.outputs.rf <- vector(mode="numeric", length=N)
adult.obs.outputs <- vector(mode="numeric", length=N)
offset <- 0
for(i in 1:K){
train <- filter(adult.obesity.data, s != i)
test <- filter(adult.obesity.data, s == i)
adult.obs.outputs[1:length(s[s==i]) + offset] <- test$Obesity
#GLM train/test
glm <- glm(Obesity ~ Health + Asthma + Race + Dentist + Usual + Marital + Income + SexIdent + MedA + Employment, data=train, family=binomial(logit))
glm.pred.curr <- predict(glm, test, type="response")
adult.pred.outputs.glm[1:length(s[s==i]) + offset] <- glm.pred.curr
#RF train/test
rf <- randomForest(Obesity ~ Income + Age + Health + Education + County + Marital + Adults + Kids + Employment + Race, data=train, ntree=100)
rf.pred.curr <- predict(rf, newdata=test, type="prob")
adult.pred.outputs.rf[1:length(s[s==i]) + offset] <- rf.pred.curr[,2]
offset <- offset + length(s[s==i])
}
roc(adult.obesity.data$Obesity, adult.glm.pred, ci = TRUE)##
## Call:
## roc.default(response = adult.obesity.data$Obesity, predictor = adult.glm.pred, ci = TRUE)
##
## Data: adult.glm.pred in 4304 controls (adult.obesity.data$Obesity Non-Obese) < 1965 cases (adult.obesity.data$Obesity Obese).
## Area under the curve: 0.6882
## 95% CI: 0.6744-0.7019 (DeLong)
roc(adult.obs.outputs, adult.pred.outputs.glm, ci = TRUE)##
## Call:
## roc.default(response = adult.obs.outputs, predictor = adult.pred.outputs.glm, ci = TRUE)
##
## Data: adult.pred.outputs.glm in 4304 controls (adult.obs.outputs 1) < 1965 cases (adult.obs.outputs 2).
## Area under the curve: 0.669
## 95% CI: 0.655-0.683 (DeLong)
roc(adult.obesity.data$Obesity, adult.rf.pred[,1], ci = TRUE)##
## Call:
## roc.default(response = adult.obesity.data$Obesity, predictor = adult.rf.pred[, 1], ci = TRUE)
##
## Data: adult.rf.pred[, 1] in 4304 controls (adult.obesity.data$Obesity Non-Obese) > 1965 cases (adult.obesity.data$Obesity Obese).
## Area under the curve: 0.9992
## 95% CI: 0.9987-0.9997 (DeLong)
roc(adult.obs.outputs, adult.pred.outputs.rf, ci = TRUE)##
## Call:
## roc.default(response = adult.obs.outputs, predictor = adult.pred.outputs.rf, ci = TRUE)
##
## Data: adult.pred.outputs.rf in 4304 controls (adult.obs.outputs 1) < 1965 cases (adult.obs.outputs 2).
## Area under the curve: 0.6477
## 95% CI: 0.6335-0.6619 (DeLong)
plot.roc(adult.obesity.data$Obesity, adult.glm.pred, col = "coral")
plot.roc(adult.obs.outputs, adult.pred.outputs.glm, col = "deepskyblue", add = TRUE)
plot.roc(adult.obesity.data$Obesity, adult.rf.pred[,1], col="darkorchid", add = TRUE)
plot.roc(adult.obs.outputs, adult.pred.outputs.rf, col="mediumseagreen", add=TRUE)
legend("bottomright",
legend=c("GLM Training", "GLM Cross-Validation", "RF Training", "RF Cross-Validation"),
col=c("coral", "deepskyblue", "darkorchid", "mediumseagreen"), lwd=1)The AUC for the models were:
In conclusion, the prediction models created were not as accurate as I would have liked, but I did uncover some interesting predictors for obesity.
To recap the top predictors for children and adults:
Children
Adults
Socioeconomic status and race has been shown to affect obesity, so I wasn’t surprised to see Income, Education, Employment, self insured (Self), County, and Race. Access to care is also important in influencing obesity since those who have access to healthcare professionals and abilty to pay for services would more likely lead a healthier life - this includes whether the respondant is insured (Insured) and where they usually go to for care (Usual). Other health outcomes may be confounding factors - health status (Health), Asthma, and whether the respondant has been to the dentist in the past year (Dentist). Also, since only those who are 65 or older or disabled are eligible for Medicare Part A, MedA is also likely a confounding health outcome factor.
I did some additional research to try to explain to remaining variables. Age is ranked highly as a predictor of obesity and the current throught is that once someone becomes obese it’s very difficult to return to a normal weight category and will likely remain obese for the rest of their life. Sex is ranked highly as a predictor in children and females are more likely to not be obese and studies have shown that it may be due to societal standards of being skinny equating to beauty. Women and especially impressionable young girls are likely to undergo extreme diets and may suffer from eating disorders to maintain this “ideal” weight. Household size (Kids and Adults) may influence obesity since those growing up with more siblings and adult figures in their life will tend towards playing with their family members instead of being glued to electonics like so many people are today. Lastly marital status (Marital) and sexual identity (SexIdent) may be a source of stress and anxiety and lead to obesity.
Identifying these predictors is only half the battle. Obesity rates will continue to rise unless healthcare policymakers take action and constantly reevaluate their initiatives for effectiveness. Access to care continues to be an important factor and it’s crucial that healthcare is available to everyone regardless of income or employment status. In addition, I think it’s important that individuals be treated as individuals that come with their own health history and emotional baggage and unique health programs are created to suit one’s needs.